17.5 Intrinsic Methods
261
M[0, 0] = z
for each i in 1 .. S1.length
M[i,0] = f( M[i-1, 0 ], c(S1[i], "_" ) )
-- Boundary
for each j in 1 .. S2.length
M[0,j] = f( M[0, j-1], c("_", S2[j] ) )
-- conditions
for each i in 1 .. S1.length and j in 1 .. S2.length
M[i,j] = g(f(M[i-1, j-1], c(S1[i], S2[j])),
-- (mis)match
f(M[i-1, j ], c(S1[i], "_" )),
-- delete S1[i]
f(M[i,
j-1], c("_",
S2[j])))
-- insert S2[j]
Applied to sequence alignment, two varieties of DPA are in use: the Needleman–
Wunsch (“global alignment”) algorithm, which builds up an alignment starting with
easily achievable alignments of small subsequences, and the Smith–Waterman (“local
alignment”) algorithm that is similar in concept, except that it does not systematically
move through the sequences from one end to the other, but compares subsequences
anywhere.
It is often tacitly assumed that the sequences are random (i.e., incompressible),
but if they are not (i.e., they are compressible to some degree), this should be taken
into account.
There are also some heuristic algorithms (e.g., BLAST and FASTA) that are
faster than the DPAs. They look for matches of short subsequences, which may be
only a few nucleotides or amino acids long, that they then seek to extend. As with
the DPAs, some kind of scoring system has to be used to quantify matches.
Although sequence alignment has become very popular, some of the assump-
tions are quite weak and there is strong motivation to seek alternative methods for
evaluating the degree of kinship between sequences, not based on symbol-by-symbol
comparison; for example, one could evaluate the mutual information between strings
aa and bb (cf. Sects. 7.4.1 and 11.5):
upper I left parenthesis s Subscript a Baseline comma s Subscript b Baseline right parenthesis equals upper I left parenthesis s Subscript b Baseline comma s Subscript a Baseline right parenthesis equals upper I left parenthesis s Subscript a Baseline right parenthesis minus upper I left parenthesis s Subscript a Baseline vertical bar s Subscript b Baseline right parenthesis equals upper I left parenthesis s Subscript b Baseline right parenthesis minus upper I left parenthesis s Subscript b Baseline vertical bar s Subscript a Baseline right parenthesis periodI (sa, sb) = I (sb, sa) = I (sa) −I (sa|sb) = I (sb) −I (sb|sa).
(17.2)
Multiple alignment is an obvious extension of pairwise alignment.
17.5
Intrinsic Methods
The template or intrinsic approach involves constructing concise descriptions of
prototype objects and then identifying genes by searching for matches to such pro-
totypes. An elementary example is searching for motifs (i.e., short subsequences)
known to interact with particular drugs. The motif is often defined more formally
along the lines of a sequence of amino acids that itself defines a substructure in a
protein that can be connected in some way to protein function or structural stability